Skip to content

Conversation

minglumlu
Copy link
Member

@minglumlu minglumlu commented Aug 28, 2025

This metric represents the rate of VM migrations with vGPUs per second.
The total count can be calculated as AVERAGE * seconds. For example,
for one-day data granularity, the total count for one day is
AVERAGE * 86400, in which 86400 is the seconds of one day.

let vgpu_migration_count_ds =
( Rrd.Host
, Ds.ds_make ~name:"pool_vgpu_migration_count"
~description:"Number of vGPU migrations occurred since last update"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this since last xapi start?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's reset on every iteration:

+    Atomic.exchange pool_vgpu_migration_count 0 |> Int64.of_int

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It's reset on every update.

Copy link
Member

@psafont psafont Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording confused me because I thought it was since the last time the host's packages were updated.

Since this is a rate (both absolute and derive are), "since the last update" should be dropped.

  • Absolute: meaning that the incoming data is an absolute rate
  • Derive: meaning that the rate must come from the difference between the
    incoming data and the previous value

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so note that "since last rrd update" is visible to users and they might not know what an rrd update is. In the ned this metric is a rate, and users don't care how it's calculated: the counter is reset every 5 seconds, or a difference is taken between the current measurement and the last one

Copy link
Member

@robhoes robhoes Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the "unit" of this metric is something like "number of vGPU migrations per 5 seconds"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the number of migrations per second, like network bandwidth used. If we want a time-independent measure a gauge should be used, like in memory.

I do think a rate is OK to be used here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK cool, then we should reflect that in the description by adding "per second"? And ~units:"count" should be "migrations/s"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For comparison:

      , ds_make ~name:(key_format "iops_total")
          ~description:"I/O Requests per second"
          ~value:(Rrd.VT_Int64 (Int64.add value.iops_read value.iops_write))
          ~ty:Rrd.Absolute ~units:"requests/s" ~min:0. ()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, yes

@@ -2501,6 +2501,10 @@ functor
let snapshot = Db.VM.get_record ~__context ~self:vm in
reserve_memory_for_vm ~__context ~vm ~host ~snapshot
~host_op:`vm_migrate (fun () ->
Db.VM.get_VGPUs ~__context ~self:vm
|> List.iter (fun _ ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a VM has multiple vGPUs, should we not still count it as a single vGPU migration?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Counting as a single vGPU migration makes the meaning simpler.

Copy link
Member

@psafont psafont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit should be changed to migrations/s

@lindig
Copy link
Contributor

lindig commented Aug 28, 2025

If we want a rate, does it really make sense to pick migrations/s? Why not scale it to migrations/h or /day? Or would this be solved at a different layer?

@minglumlu
Copy link
Member Author

minglumlu commented Aug 29, 2025

The unit should be changed to migrations/s

But it's actually "migrations/5 secs". It would be complicated if an interval is recorded for the data in tracking. Or we just assume it is always 5 seconds, and divide it by 5 seconds before update it? I really don't want to expose the "5" seconds.


I updated it with a stop watch to make it a rate (per second) now.

If we want a rate, does it really make sense to pick migrations/s? Why not scale it to migrations/h or /day? Or would this be solved at a different layer?

Yeah, this will be solved at a different layer. The data can be consolidated as migrations/day.

@minglumlu minglumlu force-pushed the private/mingl/CP-308863 branch from eb5b3bc to 4a91d30 Compare August 29, 2025 05:19
@robhoes
Copy link
Member

robhoes commented Aug 29, 2025

But it's actually "migrations/5 secs". It would be complicated if an interval is recorded for the data in tracking. Or we just assume it is always 5 seconds, and divide it by 5 seconds before update it? I really don't want to expose the "5" seconds.

I assumed that xcp-rrdd divides by the time span of ~5s itself. So if you use an Absolute RRD type, it takes a value of unit x and it always becomes x/s. Is that right @psafont?

@psafont
Copy link
Member

psafont commented Aug 29, 2025

AFAIU, xcp-rrdd divides the counter by 5, when calculating the value, as part of the rrd library

@minglumlu
Copy link
Member Author

AFAIU, xcp-rrdd divides the counter by 5, when calculating the value, as part of the rrd library

Hmm. I can't find the dividing in rrd lib. Am I missing something? I think it should be in process_ds_value https://github.com/xapi-project/xen-api/blob/master/ocaml/libs/xapi-rrd/lib/rrd.ml#L348.

And the testing is showing that the input being divided is indeed as expected.

@psafont
Copy link
Member

psafont commented Sep 1, 2025

That sounds like an issue that might have been introduced by inhttps://github.com/xapi-project/xen-api/commit/73ca3cca49fd604a2cd408c332078535f0f694fe note how on the first chunknof code changed the behaviour of absolute is not changed, but in the second one, it's not treated as a rate anymore

@last-genius could you confirm the correct behaviour?

@last-genius
Copy link
Contributor

last-genius commented Sep 1, 2025

That sounds like an issue that might have been introduced by in 73ca3cc note how on the first chunknof code changed the behaviour of absolute is not changed, but in the second one, it's not treated as a rate anymore

@last-genius could you confirm the correct behaviour?

That does seem to be the case, Absolute shouldn't have been changed along with Gauge. The correct behaviour should be as follows:

 Values       = 300, 600, 900, 1200
 Step         = 300 seconds
 DERIVE DS    =    1,  1,   1,    1
 ABSOLUTE DS  =    1,  2,   3,    4
 GAUGE DS     = 300, 600, 900, 1200

But that was not the behaviour prior to 73ca3cc either, the following unit test (both on master and before 73ca3cc) has this output:

calculated 300.000000: submitted 300.000000
calculated 0.000000: submitted 600.000000
let absolute_rrd_CA_404597 () =
  let rra = rra_create CF_Average 100 1 0.5 in
  let ts = 1000000000.0 in
  let ds = ds_create "foo" Absolute (VT_Float 0.0) in
  let rrd = rrd_create [|ds|] [|rra|] 1L ts in
  for i = 1 to 100000 do
    let t = (300. *. float_of_int i) in
    let ((_, val1) as v1) =
      (0, {value= VT_Float (300.*.float_of_int i); transform= Identity})
    in
    ds_update rrd t [|v1|] false ;

    Printf.printf "calculated %f: submitted %f\n" rrd.rrd_dss.(0).ds_value
      (float_of_string (ds_value_to_string val1.value));
    compare_float __LOC__ (float_of_string (ds_value_to_string val1.value))
    rrd.rrd_dss.(0).ds_value
  done

So I'm completely confused....

@last-genius
Copy link
Contributor

The unit test output above is the same on v24.29.0, which is before any of the recent refactoring of RRD

@last-genius
Copy link
Contributor

last-genius commented Sep 2, 2025

This patch makes an updated unit test pass:

diff --git a/ocaml/libs/xapi-rrd/lib/rrd.ml b/ocaml/libs/xapi-rrd/lib/rrd.ml
index bb516ea6a..414e2cd63 100644
--- a/ocaml/libs/xapi-rrd/lib/rrd.ml
+++ b/ocaml/libs/xapi-rrd/lib/rrd.ml
@@ -361,7 +361,9 @@ let process_ds_value ds value interval new_rrd =

     let rate =
       match (ds.ds_ty, new_rrd) with
-      | Absolute, _ | Derive, true | Gauge, _ ->
+      | Absolute, _ ->
+          value_raw /. interval
+      | Derive, true | Gauge, _ ->
           value_raw
       | Derive, false -> (
         match (ds.ds_last, value) with

test, had to adjust mrhb and starting timestamp:

let absolute_rrd_CA_404597 () =
  let rra = rra_create CF_Average 100 1 0.5 in
  let ts = 0.0 in
  let ds =
    ds_create "foo" Absolute ~mrhb:1000.0 ~min:0. ~max:infinity (VT_Float 0.0)
  in
  let rrd = rrd_create [|ds|] [|rra|] 1L ts in
  let id = Identity in
  for i = 1 to 100000 do
    let t = 300. *. float_of_int i in
    let ((_, val1) as v1) =
      (0, {value= VT_Float (300. *. float_of_int i); transform= id})
    in
    ds_update rrd t [|v1|] false ;

    compare_float __LOC__
      (float_of_string (ds_value_to_string val1.value) /. 300.)
      rrd.rrd_dss.(0).ds_value
  done

@lindig
Copy link
Contributor

lindig commented Sep 2, 2025

Your patch @last-genius is independent of this new metric - is that right? So maybe we should merge it from it's own PR?

@last-genius
Copy link
Contributor

Your patch @last-genius is independent of this new metric - is that right? So maybe we should merge it from it's own PR?

certainly, but it would change the behaviour of the new metric (well, fixing it to work as intended). I'll open a PR

@robhoes
Copy link
Member

robhoes commented Sep 2, 2025

Is interval here a value that nowadays comes from actual sample timestamps rather than being hardcoded to 5?

@last-genius
Copy link
Contributor

Is interval here a value that nowadays comes from actual sample timestamps rather than being hardcoded to 5?

indeed

@minglumlu minglumlu force-pushed the private/mingl/CP-308863 branch from 4a91d30 to 642b5c3 Compare September 3, 2025 07:27
github-merge-queue bot pushed a commit that referenced this pull request Sep 3, 2025
Absolute metric should work as follows:

    Timestamp    = 300, 600, 900, 1200
    Step         = 300 seconds
    ABSOLUTE DS  =    1,  2,   3,    4

But they do not seem to have ever worked correctly - they were
previously divided by the interval but incorrectly handled NaNs,
resulting in wrong behaviour. Then the refactoring (including 73ca3cc
("CA-404597: rrd - Pass Gauge and Absolute data source values as-is"))
broke them further.

Divide absolute metrics by the interval in process_ds_value. Fix the
unit test to expect the right behaviour - it passes with this fix.

---

This means absolute metrics are now correctly calculated as unit/s,
changing the behaviour in
#6640

I am not 100% sure this fix goes all the way either - unit tests should
cover expected values in RRAs as well, and the logic is still really
confusing to me.
This metric represents the rate of VM migrations with vGPUs per second.
The total count can be calculated as AVERAGE * seconds. For example,
for one-day data granularity, the total count for one day is
AVERAGE * 86400, in which 86400 is the seconds of one day.

Signed-off-by: Ming Lu <[email protected]>
@minglumlu minglumlu force-pushed the private/mingl/CP-308863 branch from 642b5c3 to b657c62 Compare September 5, 2025 05:32
@minglumlu
Copy link
Member Author

minglumlu commented Sep 5, 2025

Hi @psafont
Are you happy with this PR? I manually tested with #6646. It works as expected now :)

https://10.70.48.96/rrd_updates?session_id=OpaqueRef:cdf1aff3-4cd5-2323-cde8-3a1595c39e4d&start=1757043242&host=true&cf=AVERAGE&interval=60
2025-09-05 13:34:00: 0 (original float: 0.0)
2025-09-05 13:33:00: 0 (original float: 0.0)
2025-09-05 13:32:00: 0 (original float: 0.0)
2025-09-05 13:31:00: 2 (original float: 1.999984458088872)
2025-09-05 13:30:00: 0 (original float: 0.0)
2025-09-05 13:29:00: 0 (original float: 0.0)
2025-09-05 13:28:00: 0 (original float: 0.0)
2025-09-05 13:27:00: 0 (original float: 0.0)
2025-09-05 13:26:00: 2 (original float: 2.002933993935588)
2025-09-05 13:25:00: 1 (original float: 0.9993496164679501)
2025-09-05 13:24:00: 0 (original float: 0.0)
2025-09-05 13:23:00: 1 (original float: 1.0000780597329122)
2025-09-05 13:22:00: 0 (original float: 0.0)
2025-09-05 13:21:00: 0 (original float: 0.0)
  • The "original float" is the <rrd value> * <interval>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants